Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the catalog, making use of rc3 fixes #92

Merged
merged 4 commits into from
May 29, 2020
Merged

Conversation

beckjake
Copy link
Contributor

@beckjake beckjake commented May 26, 2020

resolves #85
resolves #89
resolves #90

Of course, 0.17.0rc3 doesn't exist yet, but it will soon, and it will fix these problems! 0.17.3 exists and this points to it now.
If it was resolved by dbt-labs/dbt-core#2489, I marked it as resolved by this PR - peeling the consequences of the two apart sounds annoying.

This PR takes the opinionated stance that you aren't allowed to use database in spark, it's always None. There is probably room to go further (maybe spark__generate_database_name should raise if your custom_database_name is non-None), but I think this is functional.

  • remove _parse_relation - it fully duplicates other code now.
  • make use of dbt's DEFAULT_TYPE_TESTER for creating the agate table. This prevents an issue with 1 row becoming True rows in the statistics output.
  • Override the core generate_database_name macro to return None.
  • statistics and table owners now work, hooray.

@jtcohen6
Copy link
Contributor

I took this for a spin, and I'm still seeing the two errors as before:

  1. If a source or snapshot doesn't have a database/target_database defined that's equal to source_name/schema/target_schema. In the example below, dbt_default is the target.schema configured inprofiles.yml, and dbt_snapshot is configured from within the snapshot block.
Error while parsing relation test_ts:
      identifier: test_ts
      schema: dbt_snapshot
      database: dbt_default
  On Spark, database should not be set. Use the schema config to set a custom schema/database for this relation.
  1. After manually setting the database/target_database param, I see the following when trying to generate docs:
Encountered an error while generating catalog: Compilation Error
  Expected only one schema in spark _get_one_catalog
dbt encountered 1 failure while writing the catalog

I think we still need a way to:

  • Set database = None for all nodes/relations, including sources and snapshots
  • Tell dbt to _get_one_catalog per unique schema instead of per unique database

@jtcohen6
Copy link
Contributor

I checked out the latest code in this branch, and I'm still seeing similar issues as above.

Sources + snapshots

The first error case now returns Cannot set database in spark! instead of On Spark, database should not be set.... To replicate:

  • Add a source with a name (or schema) that is different from target.schema (as configured in the profile), and missing a database param
  • Or, as above but a snapshot with target_schema that is different from target.schema (from profile), and missing target_database
  • Run dbt compile

The way to resolve the error currently is by setting the database config to the same value as the schema. This double-setting shouldn't be necessary, and it's counterintuitive because we don't want users to use the database configs at all.

Catalog generation

The second error case is still Expected only one schema in spark _get_one_catalog. To replicate:

  • Add a model with a custom schema configured.
  • Run dbt docs generate

Somehow, dbt needs to know to run _get_one_catalog once per schema instead of once per database (which should now always be None). It looks like the run-once-per-database behavior is implicit within the information_schema argument... is there a way around this?

Next steps?

Ideally, the scenarios above would be included in our integration testing suite, which we know is a ways away from where it needs to be. If it would be helpful in continuing work on this branch, I can try adding to the dbt-integration-tests spark branch:

  • a source
  • a model with a custom schema

I think the highest-priority next step is figuring out if these are fixes we can make within the plugin, or if there is a lingering core issue in the way.

Fix catalog generation to properly handle 1 database (None!) with many schemas
Update unit tests
@beckjake
Copy link
Contributor Author

I've added fixes to the issues you identified.

It looks like the run-once-per-database behavior is implicit within the information_schema argument... is there a way around this?

This happens in the adapter portion of the code, so yup!

Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works like a dream! Thank you for taking it the last mile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Re-add table owner + stats to catalog Master not working with sources
2 participants